we've trained an ai model that compresses sequences of token embeddings into shorter sequences of token embeddings, which it then attempts to reconstruct the original text fromâwith varying degrees of success.
the main dial we can turn is the number of "embedding tokens" used to represent a text. here's how it works:
the ultimate purpose of this architecture is to serve as a component in a system for farming compute from oblivious human minds over the internet, not text compression2. but while we cook up the other parts, this is pretty fun to play with.
here's a text generated by midori3, encoded and decoded with various counts of embedding tokens:
here's a few more examples of texts and their compressed-then-decompressed versions
you can also embed a text, add that text's embedding to another text's embedding, and attempt a decoding, which can result in some wonderfully weird outputs4
you can access the model used here via huggingface, at midwestern-simulation/essence-3b-v1.2.
- crumb, dove
1a decoder transformer, ironically
2if lossless compression using lms is your goal, you'd be better off with entropy coding. this is something else.
3a creative model trained by dove
4they might be less weird, or more coherent, if you trained a VAE on embeddings from this model (which we plan to do), encoded embeddings w/ the vae before arithmetic and then decode (vae) and decode (llm), putting the embeddings before the decoder onto the realistic embeddings manifold